This notebook contains all code for parsing and analysis the protein clusters generated using MMSeq2.
A representative sequence from each of the 4 largest clusters compiled using MMSeq2 (with a percentage identity and coverage of cut-off of 70%) was extracted from a local CAZyme database using cazy_webscraper.
Each representative sequence was identified by using the GenBank accession assigned as the name of the cluster by MMSeq2.
The 4 representative sequences were analysed using BLASTP. The Python script run_blastp.py from the Python package pyrewton DOI:10.5281/zenodo.3876218) was used to run the BLASTP all-versus-all analysis.
The table compiled by BLASTP contains the following columns: - qseqid - sseqid - pident - length - mismatch - gapopen - qstart - qend - sstart - send - evalue - bitscore
In order to compare each of the pair-wise-alignment we need to calculate the Blast Score Ratio (SCR) to normalise for length.
The bitscore reported by BLAST is the sum of the qualities of the aligned symbols over the whole alignment. This is an accurate measure of the alignment strength, but long sequences tend to have higher bitscores than short sequences, even when the matches are of about the same quality. To correct for this length effect, we can calculate a normalised bitscore where:
normalised bitscore = bitscore / query length
Table 2.1 presents the raw output from BLASTP as well as the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
| qseqid | sseqid | pident | cov | qlen | slen | alen | bitscore | evalue | bsr |
|---|---|---|---|---|---|---|---|---|---|
| CBK69950.1 | CBK69950.1 | 100.000 | 100 | 539 | 539 | 539 | 1121 | 0 | 2.0797774 |
| CBK69950.1 | AGE22437.1 | 49.171 | 100 | 539 | 567 | 543 | 520 | 0 | 0.9647495 |
| CBK69950.1 | CDG29680.1 | 49.631 | 99 | 539 | 533 | 542 | 520 | 0 | 0.9647495 |
| CBK69950.1 | QJR11213.1 | 44.383 | 99 | 539 | 534 | 543 | 412 | 0 | 0.7643785 |
| QJR11213.1 | QJR11213.1 | 100.000 | 100 | 534 | 534 | 534 | 1107 | 0 | 2.0730337 |
| QJR11213.1 | CDG29680.1 | 47.532 | 99 | 534 | 533 | 547 | 477 | 0 | 0.8932584 |
| QJR11213.1 | AGE22437.1 | 49.358 | 99 | 534 | 567 | 545 | 473 | 0 | 0.8857678 |
| QJR11213.1 | CBK69950.1 | 44.954 | 99 | 534 | 539 | 545 | 417 | 0 | 0.7808989 |
| CDG29680.1 | CDG29680.1 | 100.000 | 100 | 533 | 533 | 533 | 1118 | 0 | 2.0975610 |
| CDG29680.1 | AGE22437.1 | 66.417 | 99 | 533 | 567 | 533 | 753 | 0 | 1.4127580 |
| CDG29680.1 | CBK69950.1 | 49.631 | 99 | 533 | 539 | 542 | 520 | 0 | 0.9756098 |
| CDG29680.1 | QJR11213.1 | 47.091 | 99 | 533 | 534 | 550 | 471 | 0 | 0.8836773 |
| AGE22437.1 | AGE22437.1 | 100.000 | 100 | 567 | 567 | 567 | 1179 | 0 | 2.0793651 |
| AGE22437.1 | CDG29680.1 | 66.417 | 94 | 567 | 533 | 533 | 753 | 0 | 1.3280423 |
| AGE22437.1 | CBK69950.1 | 49.171 | 94 | 567 | 539 | 543 | 520 | 0 | 0.9171076 |
| AGE22437.1 | QJR11213.1 | 49.088 | 93 | 567 | 534 | 548 | 469 | 0 | 0.8271605 |
Figure 2.1 is an interactive plot presenting the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.
To view the specific BSR for each comparison, hover over the plot and a tooltip will appear and will present the GenBank accessions of the corresponding proteins as well as the specific BSR value (to 3dp).
Figure 2.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
The cluster CBK69950.1 contained 33 protein sequences, QJR11213.1 28 protein sequences, CDG29680.1 17 protein sequences, and AGE22437.1 contaiend 13 protein sequences.
The BSR infer that the two smaller clusters (CDG29680.1 and AGE22437.1) could potentially be combined to create a large cluster. The proteins in this new combined clusters could then be aligned to create an multisequence alignment (MSA) of functionally relevant proteins for molecular modeling. Conversely, the BSR inferred the two larger clusters should be kept separate ( and not combined) to create a high quality MSA (i.e. minimal gaps and readily identifiable consensus sequences) of the proteins within each cluster.